{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LightGBM Parameter Tuning for Otto Dataset"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们以Kaggle 2015年举办的Otto Group Product Classification Challenge竞赛数据为例，进行XGBoost参数调优探索。\n",
    "\n",
    "竞赛官网：https://www.kaggle.com/c/otto-group-product-classification-challenge/data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二步：调整树的参数：num_leaves\n",
    "(粗调，参数的步长为10；下一步是在粗调最佳参数周围，将步长降为2，进行精细调整)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先 import 必要的模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import lightgbm as lgbm\n",
    "\n",
    "import pandas as pd \n",
    "import numpy as np\n",
    "\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "from sklearn.metrics import log_loss\n",
    "\n",
    "from matplotlib import pyplot\n",
    "import seaborn as sns\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 读取数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# path to where the data lies\n",
    "dpath = './data/'\n",
    "train = pd.read_csv(dpath +\"Otto_train.csv\")\n",
    "#train.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Variable Identification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "选择该数据集是因为的数据特征单一，我们可以在特征工程方面少做些工作，集中精力放在参数调优上"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Target 分布，看看各类样本分布是否均衡"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "#sns.countplot(train.target);\n",
    "#pyplot.xlabel('target');\n",
    "#pyplot.ylabel('Number of occurrences');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "每类样本分布不是很均匀，所以交叉验证时也考虑各类样本按比例抽取"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# drop ids and get labels\n",
    "y_train = train['target']\n",
    "y_train = y_train.map(lambda s: s[6:])\n",
    "y_train = y_train.map(lambda s: int(s)-1)\n",
    "\n",
    "train = train.drop([\"id\", \"target\"], axis=1)\n",
    "X_train = np.array(train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "第一轮参数调整得到的n_estimators最优值（539），其余参数继续默认值"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "用交叉验证评价模型性能时，用scoring参数定义评价指标。评价指标是越高越好，因此用一些损失函数当评价指标时，需要再加负号，如neg_log_loss，neg_mean_squared_error 详见sklearn文档：http://scikit-learn.org/stable/modules/model_evaluation.html#log-loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'num_leaves': [40, 50, 60]}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "#num_leaves \n",
    "num_leaves = [40,50,60]\n",
    "param_test2_1 = dict(num_leaves=num_leaves)\n",
    "param_test2_1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "# prepare cross validation\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/Users/qing/anaconda2/lib/python2.7/site-packages/lightgbm/sklearn.py:282: LGBMDeprecationWarning: The `seed` parameter is deprecated and will be removed in next version. Please use `random_state` instead.\n",
      "  'Please use `random_state` instead.', LGBMDeprecationWarning)\n",
      "/Users/qing/anaconda2/lib/python2.7/site-packages/lightgbm/sklearn.py:285: LGBMDeprecationWarning: The `nthread` parameter is deprecated and will be removed in next version. Please use `n_jobs` instead.\n",
      "  'Please use `n_jobs` instead.', LGBMDeprecationWarning)\n"
     ]
    }
   ],
   "source": [
    "params = {'boosting_type': 'gbdt', \n",
    "          'objective': 'multiclass', \n",
    "          'nthread': -1, \n",
    "          'silent': True,\n",
    "          'learning_rate': 0.1, \n",
    "          'max_depth': 6,\n",
    "          'max_bin': 127, \n",
    "          'subsample_for_bin': 50000,\n",
    "          'subsample': 0.8, \n",
    "          'subsample_freq': 1, \n",
    "          'colsample_bytree': 0.8, \n",
    "          'reg_alpha': 1, \n",
    "          'reg_lambda': 0,\n",
    "          'min_split_gain': 0.0, \n",
    "          'min_child_weight': 1, \n",
    "          'min_child_samples': 20, \n",
    "          'scale_pos_weight': 1}\n",
    "\n",
    "lgbm2_1 = lgbm.sklearn.LGBMClassifier(num_class= 9, n_estimators=539, seed=0, **params)\n",
    "\n",
    "gsearch2_1 = GridSearchCV(lgbm2_1, param_grid = param_test2_1, scoring='neg_log_loss',n_jobs=-1, cv=kfold)\n",
    "gsearch2_1.fit(X_train , y_train)\n",
    "\n",
    "gsearch2_1.grid_scores_, gsearch2_1.best_params_,     gsearch2_1.best_score_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "gsearch2_1.cv_results_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# summarize results\n",
    "print(\"Best: %f using %s\" % (gsearch2_1.best_score_, gsearch2_1.best_params_))\n",
    "test_means = gsearch2_1.cv_results_[ 'mean_test_score' ]\n",
    "test_stds = gsearch2_1.cv_results_[ 'std_test_score' ]\n",
    "train_means = gsearch2_1.cv_results_[ 'mean_train_score' ]\n",
    "train_stds = gsearch2_1.cv_results_[ 'std_train_score' ]\n",
    "\n",
    "pd.DataFrame(gsearch2_1.cv_results_).to_csv('my_preds_num_leaves_1.csv')\n",
    "\n",
    "# plot results\n",
    "pyplot.plot(num_leaves, -test_means)\n",
    "   \n",
    "pyplot.legend()\n",
    "pyplot.xlabel( 'num_leaves' )                                                                                                      \n",
    "pyplot.ylabel( 'Log Loss' )\n",
    "pyplot.savefig('num_leaves.png' )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.14"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
